Clustering of language status by language usage

Ethnologue maintain a database of resources on world languages and it hosted over 7000 languages profiles online. The data are mined from their open website to compile a list of language profiles. We will analyse 2 particular properties compiling of each language usage and its written form (written vs unwritten).

We took those languages that have description on its language usage and group them by its language status defined in Expanded Graded Intergenerational Disruption Scale (EGIDS). We then compile a dictionary to list out the most used words after we cleaned the text from English stop words, numbers and puntuations.

We then filtered high frequencies terms that do not have any significant meaning, the frequency cut off point is chosen to be 1400.

topfeatures(lang.use.dfm, 20)
##        use       also       ages       used  attitudes   positive 
##       4983       4539       1727       1482       1373       1205 
##         l2   vigorous    domains       home    english        eng 
##       1198        915        878        858        790        768 
##   language   children     adults      older    spanish        spa 
##        626        625        578        497        491        465 
## especially   speakers 
##        423        362
lang.use.dfm<-dfm_trim(lang.use.dfm, max_count=1400)

We then compute the similarity on language usage between each EGIDS score using extended Jaccard index method and apply hierarchical clustering via Ward’s methodology to understand if any of the EGIGS are similar and can be clustered together.

Using words describing how the languages are use, we able to see some similarity between each language EGIDS score that are close to each other. The cluster formations are very close to the original definition of the EGIDS index where 0-4 are Institutional, 5 represent Developing, 6a represent Vigorous (usage), 6b-7 for In Trouble, 8a-9 for Dying and 10 represent Extinct.

lang.use.dist<-textstat_simil(lang.use.dfm, method="eJaccard")
corrplot(as.matrix(lang.use.dist), hclust.method = "ward.D2", order="hclust", addrect=4)

Dendrogram showing the hierarchical cluster.

h<-hclust(as.dist(1-as.matrix(lang.use.dist)), method="ward.D2")
plot(h)
abline(h=1.0, lty=2)

group<-cutree(h, k=4)
topStatusFeatures<-sapply(seq(1,max(group)), function(x) {
  topfeatures(dfm_select(lang.use.dfm, documents=names(group[which(group==x)])))
}, simplify = F)
names(topStatusFeatures) <- sapply(seq(1,max(group)), function(x) {paste(sort(names(group[which(group==x)])), collapse=",")})
topStatusFeatures[c(4,1,2,3)]
## $`1,2,3,4`
##        l2 attitudes  positive   domains   english       eng  vigorous 
##       246       177       167       142       122       119       111 
##  language      home  european 
##        68        57        52 
## 
## $`5,6a,6b`
## attitudes  positive        l2  vigorous      home   domains   english 
##      1085       976       877       801       688       685       446 
##  children       eng  language 
##       439       433       366 
## 
## $`7,8a,8b`
##   adults    older  english      eng shifting     wurm speakers children 
##      332      332      187      182      176      166      149      147 
## language   mainly 
##      146      144 
## 
## $`10,9`
##        shifted       language        english            eng     portuguese 
##            116             46             35             34             25 
##            por revitalization          golla       speakers          speak 
##             23             17             16             15             14

Observation

From the EGIDS cluster, we list out the top words. We see that languages in EGIDS 1-4 are often use as 2nd language (L2) and the speakers generally having positive attitudes towards the language. It is also being used in certain or all domains in their society.

EGIDS 5, 6a and 6b, the speakers do have positive attitudes but lack of using it as 2nd language. It also still being use vigorously but may limit to certain domains or home.

EGIDS 7, 8a and 8b, these languages are mainly use by adults or older generation, and seem to also know English and are shifting to use other languages.

And lastly, for EGIDS 9 and 10, the speakers have shifted to other language, probably English or Portuguese.

Comparing written vs unwritten languages by each cluster

We look at the language written form and categorise languages that have some script in writing as “Written” versus category of “Unwritten” which was declared in the qualitative variables in the language profile.

We analyse the ratio between written versus unwritten for all the languages that we clustered in the first section.

kable(data.ele)
Unwritten Written Total
1,2,3,4 7 474 481
5,6a,6b 653 2432 3085
7,8a,8b 257 281 538
10,9 48 22 70
plotPie()

Observation

The pie chart shows that language that is widely used (EGIDS 1-4 and EGIDS 5, 6a and 6b) mainly have some form of written text (> 75% of languages). Compared to languages that are dying and extinct, which have high percentage of languages that do not have any written text.

This may show that well adopted languages are more developed with form of written text which can be institutional adopted and widely transmitted.

Appendix: Code blocks

require(dplyr)
require(quanteda)
require(corrplot)

lang.prop<-read.csv("../input/ethnologue.csv", na.strings = "")
lang.prop.complete<-filter(lang.prop, complete.cases(language_use, classification))
lang.prop.complete<-lang.prop.complete %>% mutate(stat_num=gsub("^([0-9a-z]+) .*", "\\1", language_status))

lang.use.cat<-do.call("rbind", sapply(unique(lang.prop.complete$stat_num), function(s) {
  c<-which(lang.prop.complete$stat_num == s)
  data.frame(cat=s, language_use = paste(lang.prop.complete$language_use[c], collapse = " "),
             stringsAsFactors = F)
}, simplify = F))
lang.use.corpus<-corpus(lang.use.cat$language_use)
docnames(lang.use.corpus) <- lang.use.cat$cat
lang.use.dfm <- dfm(lang.use.corpus, 
                    remove_numbers=T,
                    remove = stopwords(), stem = F, remove_punct = TRUE)
require(knitr)
write.na<-which(is.na(lang.prop.complete$writing))
lang.prop.complete.writing<-lang.prop.complete[-write.na,]
lang.prop.complete.writing<-lang.prop.complete.writing %>% mutate(writing.script=gsub("(^[a-zA-Z]+) .*", "\\1", writing)) %>% mutate(writing.script=ifelse(writing.script!="Unwritten","Written", writing.script))
data<-sapply(seq(1,max(group)), function(x) {
  sel<-which(lang.prop.complete.writing$stat_num %in% names(group[group==x]))
  table(lang.prop.complete.writing[sel, "writing.script"])
}, simplify = F)
names(data) <- sapply(seq(1,max(group)), function(x) {paste(sort(names(group[which(group==x)])), collapse=",")})
data<-do.call("rbind", data[c(4,1,2,3)]) 
r<-row.names(data)
data.ele<-mutate (as.data.frame(data), Total=rowSums(data))
row.names(data.ele) <-r
plotPie<-function() {
  require(plotly)
  colortone<-list(
    colors = c(rgb(253,174,97, maxColorValue=255), rgb(43,131,186, maxColorValue=255)))
  subTitleStyle<-list(
    font = list(family = "Courier New, monospace", size = 16, color = "black"),
    xref = "paper",
    yref = "paper",
    yanchor = "bottom",
    xanchor = "center",
    align = "left",
    showarrow = FALSE)
  plot_ly() %>%
    add_pie(data = data.frame(n=data[1,], doe=names(data[1,])), labels = ~doe, values = ~n,
            textposition="outside", direction="clockwise", sort=F, rotation=0,
            marker=colortone,
            name = "Status: 1,2,3,4", domain = list(x = c(0, 0.4), y = c(0.6, 1))) %>%
    layout(annotations=append(list(text="Status: 1,2,3,4", x=0.1, y=0.5), subTitleStyle)) %>%
    add_pie(data = data.frame(n=data[2,], doe=names(data[2,])), labels = ~doe, values = ~n,
            textposition="outside", direction="clockwise", sort=F, rotation=0,
            name = "Status: 5,6a,6b", domain = list(x = c(0.6, 1), y = c(0.6, 1))) %>%
    layout(annotations=append(list(text="Status: 5,6a,6b", x=1, y=0.5), subTitleStyle)) %>%
    add_pie(data = data.frame(n=data[3,], doe=names(data[3,])), labels = ~doe, values = ~n,
            textposition="outside", direction="clockwise", sort=F, rotation=180,
            name = "Status: 7,8a,8b", domain = list(x = c(0, 0.4), y = c(0.4, 0))) %>%
    layout(annotations=append(list(text="Status: 7,8a,8b", x=0.1, y=-0.1), subTitleStyle)) %>%
    add_pie(data = data.frame(n=data[4,], doe=names(data[4,])), labels = ~doe, values = ~n,
            textposition="outside", direction="clockwise", sort=F, rotation=180,
            name = "Status: 9,10", domain = list(x = c(0.6, 1), y = c(0.4, 0))) %>%
    layout(annotations=append(list(text="Status: 9,10", x=1, y=-0.1), subTitleStyle)) %>%
    layout(title = 'Written languages vs unwritten languages', showlegend = T,
                  xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = F),
                  yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = F))
}